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1 Did Somebody Say PARTI? 

1.1 Overview 

PARTI stands for “Parallel Automated Runtime Toolkit at ICASE.” Development 
of PARTI has been carried out at Yale University as well as ICASE and hence has 
been referred to as “PARTY” in some earlier papers. The PARTI runtime primitives 
are designed to help users to efficiently program loops found in irregular problems 
(e.g. unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differ- 
ential equations solvers). These primitives are also designed for use in compilers for 
distributed memory multiprocessors. In the context of the PARTI project, we are 
also developing a variety of other tools including compilers for distributed machines. 
These primitives are some of the basic building blocks we are using in our efforts. 

The primitives in this distribution run on any of the iPSC/2 or iPSC/860 machines 
produced by Intel Scientific Computing. They could easily be modified to run on most 
distributed memory machines. This document describes the operation of the PARTI 
primitives and gives several examples of how to use them. The rationale of the PARTI 
system (the PARTI line, as it were) was presented in [2] and summarized in [4]. 
The mechanisms incorporated in these primitives have been outlined in [2], [5], 

[4]. PARTI has been used in a variety of applications, including sparse matrix linear 
solvers, adaptive computational fluid dynamics codes, and in a prototype compiler 
[4] aimed at distributed memory multiprocessors. 

1.2 Primitives Available in the Release 

The PARTI system is divided into several levels. Level 0 primitives allow proces- 
sors to access the distributed memory of a multiprocessor with a modicum of con- 
venience. Level 1 primitives bind mapping information to arrays. This allows the 
user to store and manipulate constructs that describe multiprocessor mappings of 
distributed multidimensional arrays. Included with this distribution are the level 0 
primitives outlined next. 

The level 0 scatter allows each processor of a distributed memory machine to move 
data to off-processor memory locations. The level 0 gather allows each processor to 
obtain copies of data from memory locations in other processors. Level 0 primitives 
are provided to support initialization and access of distributed translation tables. 
Such distributed tables allow a user to assign globally numbered indices to processors 
in an irregular pattern. By using a distributed translation table, it is possible to avoid 
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replicating records of where distributed array elements are stored in all processors. 
Level 0 primitives also carry out off-processor accumulations; e.g. any processor can 
add to the contents of an off-processor memory location. 


1.3 Primitives that exist but are not yet distributed 

There are additional level 0 primitives not included with this release that support local 
caching of copies of off-processor data. These Level 0 primitives are presented in [3] 
and will be available in future PARTI releases. Level 1 primitives, also not available 
with this release, allow users to specify how distributed arrays are to be mapped 
onto sets of processors. The level 1 primitives support read, write and accumulate 
accesses to these mapped multidimensional arrays. The level 1 primitives also allow 
users to dynamically remap distributed arrays. The Level 1 primitives are described 
in [1]. It should be noted that use of PARTI primitives do not interfere with access 
to traditional message passing communications primitives. In particular, a user can 
call all of the iPSC supplied routines when using PARTI. 

2 Installation 

2.1 Getting PARTI 

PARTI can be had in either several shar files or one tar file. The tar file is in general 
more convinient, but the shar files can be sent through the mail. PARTI can be 
obtained by anonymous ftp from ra.cs.yale.edu, from netlib, or by contacting: 

Raja Das 
ICASE 

Mail Stop 132C 

NASA Langley Research Center 

Hampton, Va 06511 

(804) 864-8004 

raj aQicase.edu 

If you have the PARTI tar file, just change to the directory where you wish to put 
the PARTI subdirectory and type: 

tar xof parti. tar 
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If you have the shar files, things are only mildly worse. You need the following 
files: docs. shar, free. shar, matmult.shar, papers. shar, src.shar, tests. shar, unst.shar 
and a makefile (called “makefile”, oddly enough.) Put these files in the directory 
where you want the PARTI subdirectory and type 

make unshar 


2.2 Building PARTI 

Either of the above installation procedures should create the following directory struc- 
tures: 

parti/docs documentation in latex, postscript and plain text 

parti / examples /matmult sparse matrix multiplication described in Section B 

parti/examples/unst sweep over unstructured mesh, described in section A. 

parti/exam pies /free a conjugate gradient linear equation solver cg.c and cg_host.c 
not discussed in this documentation. (Free prize included in every copy of 
PARTI!). Also included is simplex, a simple example involving several of the 
primitives. 

parti/papers some of the relevant papers 

parti/src source for the PARTI primitives 

parti/tests test programs to verify correct installation 

A makefile should be present in the PARTI directory. At the beginning of this 
makefile are several macros to be modified by the user. 

NFLAG This macro is passed to the C compiler and linker when compiling and/or 
linking node programs. It should have one of the following values: 

-node -sx for iPSC/2 machines with weitek floating point accelerators 
-node -i 860 for iPSC/860 machines 
-node for vanilla iPSC/2 machines 


3 


NARC This macro indicates the archive to be used in creating the PARTI library. 
It should be set to one of the following: 

ar for any iPSC/2 
ar860 for an iPSC/860 

LIB This macro should be set to the directory where the party library will be in- 
stalled. It is prudent to use the full path name here. This directory must exist 
before the system is installed. 

INCL This macro should be set to the directory where the PARTI include files will 
reside. It is prudent to use the full path name here. This directory must exist 
before the system is installed. 

NPROCS This indicates the largest number of processors that the tests should be 
run on. Eight and sixteen are good values. 

NODECC This macro should be set to the C compiler which will compile the node 
programs. The default compiler (cc) is always a correct choice. The pgcc 
compiler may also be used where appropriate. 

NODEF77 This macro should be set the Fortran compiler to be used to compile 
the node programs. The default compiler (f77) is always a correct choice. The 
pgf77 compiler may be used where appropriate. 

Make sure that the directories pointed to by LIB and INCL exist. If they do not, any 
attempt to install the party system there will fail. There are several objects to make. 
Typing the following make commands in the listed order should be sufficient to install 
and check the PARTI system on your computer. 

make will compile the PARTI library but not install it in the designated directories, 
make install will install the PARTI system in the designated directories, 
make clean will remove object and executable file from various subdirectories, 
make test will run several tests to see if everything has been compiled correctly. 


4 



3 Function Descriptions 

3.1 Header Files 

There are two header files which go with the PARTI library. The first is part i .h. This 
file contains the definitions of all structures, macro definition and function definitions 
needed to run the PARTI primitives. It must be included in all C programs that use 
the PARTI system. The second include file, parti-more .h, is used only when the 
system is compiled. It defines such things as message types, and static buffer lengths. 

It should not be necessary to include this file in applications which use PARTI. No 
header files need be included in Fortran applications. 

Two of the primitives schedule and build_translation_table are functions that 
carry out preprocessing, schedule and build_translation_table allocate elements 
of structures schedule .struct and trans.table and then return pointers to struc- 
tures. The above structures are defined in parti.h; macro definitions define struct 
schedule-struct as SCHED and define struct trans.table as TTABLE. parti.h 
also defines macros STRIPED and BLOCKED used in the procedure build.translation.table 

3.2 Level 0 primitives 

Level 0 gathers and scatters are accomplished by using three routines: Scheduler , 

Gather , and Scatter. 

Scheduler on each processor is passed a list of indices Kj into aloe on each proces- 
sor j. Scheduler produces a schedule S that controls the data that are to be fetched 
off-processor by Gather or scattered off-processor by Scatter . 

On each processor, Gather inputs 

1. a buffer into which the fetched elements are to be placed 

2. a pointer to local array aloe 

3. the schedule S produced by Scheduler 

In Fig. 1 we introduce a running example to illustrate the Scheduler , Gather and 
Scatter. In this example we have three processors, each processor is passed a set of 
off-processor indices. 

Gather executes sends and receives that fetch from processor j the appropriate 
elements from the array aloe on processor j. Then it places these elements into 
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Figure 1: Scheduler Example 


Scheduler: 

inputs list of indices on each processor 
outputs a schedule S 

E.g. \ 

processor 1: (processor 2, index 5), (processor 3, index 7) 
processor 2: (processor 1, indices 4, 5, 6), (processor 3 index 2) 
processor 3: (processor 1, index 1), (processor 2 indices 1, 3, 4) 

the user-supplied buffer. Fig. 2 continues the running example begun in Fig. 1. On 
processor j the array aloe is initialized as aloc(i) = j * 100 -f i for 1 < i. We 
depict the contents of buffer on each processor after Gather is executed. 

Scatter is passed 

1. a buffer from which each scattered datum is to be obtained 

2. a pointer to local array aloe 

3. the schedule S produced by Scheduler 

Scatter executes sends and receives that put on processor j the appropriate elements 
from the buffer. Then Scatter places these elements into the appropriate elements of 
array aloe on processor j. Fig. 3 continues the running example. We assume that on 
processor j, we initialize buffer as buffer(i) — j * 100 T i for 1 ^ i ; we initialize 
aloe so that aloc(i) = 0. After Scatter executes, we depict, on each processor j 
the contents of aloe. 

3.2.1 Functioning of the Scheduler, Gather and Scatter 

Both the procedures Scatter and Gather have three stages. They permute data into 
buffers to be sent. They perform the needed communication, then they perform 
another permutation. 
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Figure 2: Gather Example 

Gather: 

inputs schedule S produces by Scheduler 

inputs pointer to local array aloe from which gathered elements are to be 
fetched 

outputs fetched elements placed in local array buffer 
E.g. assume 

processor 1: aloc(i) = 100 + i , 1 < i 

processor 2: aloc(i) = 200 + i , 1 < i 

processor 3: aloc(i) = 300 + i , 1 < i 

Gather returns: 


buffer 

Processor 

1 

Processor 

2 

Processor 

3 

1 

205 

104 

101 

2 

307 

105 

201 

3 

- 

106 

203 

4 

- 

302 

204 
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Scatter: 


Figure 3: Scatter Example 


inputs schedule S produces by Scheduler 

inputs elements to be scattered, these are placed in local array buffer 
outputs scattered elements, these are placed in local array aloe 

E.g. assume 


processor 1: 
processor 2: 
processor 3: 

processor 1: 
processor 2: 
processor 3: 


buffer(i) = 100 + i , 1 < i 
buffer(i) = 200 + i , 1 < i 
buffer(i) = 300 + i , 1 < i 

aloc(i) = 0, 1 < i 
aloc(i) = 0, 1 < i 
aloc(i) = 0, 1 < i 


After Scatter is called: 


aloe 

Processor 

1 

Processor 

2 

Processor 

3 

i 

301 

302 

0 

2 

0 

0 

204 

3 

0 

303 

0 

4 

201 

304 

0 

5 

202 

101 

0 

6 

203 

0 

0 

7 

0 

0 

102 
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The scheduler first determines how many messages each processor must send and 
receive during the data exchange phase. Defined on processor j is an array nmsgs^. 
Processor j sets the value of nmsgs^(i) to 1 if it needs data from processor i or to 0 
if it does not. The scheduler then replaces rnnsgs^ with the element-by-element sum 
nmsgs^i) +— nmsgs k (i). This operation utilizes a function that imposes a fan-in 
tree to find the sums. Since the resulting sum is kept in ninsgs^, at the end of the 
fan-in on every processor, nmsgs^i) is the number of messages that processor must 
send during the exchange phase. Next, each processor sends a request list to every 
other processor. The request list sent from processor p to processor q contains the 
indices of data needed by processor p that are stored on processor q. 

The number of non-empty request lists each processor will receive is equal to 
the number of messages that the processor will send in the gather or scatter phase. 
Each request list is placed in an array indexed by the processor from which the list 
came. When the scheduler is finished, each processor has an array of request lists 
obtained from other processors. The j th element of this array contains the request 
list obtained from processor j . At this point in the execution, each processor i knows 
which elements of aloe local to processor i that must be sent to other processors. 
This information is used to generate the schedule S of pairs of send and receive 
statements. These send/receive pairs will exchange the requested data for either a 
gather or a scatter. The gather or the scatter is passed the schedule S with the 
required buffer space. It then carries out the required communication. 

3.3 schedule() 

This procedure carries out the preprocessing needed for carrying out optimized gather 
exchanger and scatter exchanger routines. Every processor must participate in this 
procedure call. On each processor, a schedule is passed a list of processors and local 
indices from which a gather procedure on that processor can later obtain data (or to 
which a scatter procedure on that processor can later write data), schedule returns 
a pointer to a structure of type SCHED, this pointer is used in gather, scatter and 
scatter -FUNC operations (Sections 3.4, 3.5, 3.6). 


Synopsis 

SCHED *schedule(local,proc,ndata) 
Parameter declarations 
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int *local local index to be gathered from or scattered to 
int *proc processors to be gathered from or scattered to 
int ndata number of data involved in gather or scatter 

Return value 

Returns pointer to structure of type SCHED which can be used in PREFIXgather, 
PREFIXscatter, PREFIXscatter_add, PREFIXscatterjsub, PREFIXscatter_mult. 

Example 

Node 0 schedules a fetch of elements 1 and 2 from a (so far unspecified) array on 
node 1; node 1 schedules a fetch of element 1 from an array on node 0 and 0 from 
an array on node 1. 


int local [2], proc[2], ndata; 
SCHED *schedinfo; 


if (mynode () ==0) { 
proc[0] = 1; 
local [0] = 1; 
proc [1] = 1; 
local [1] = 2; 
ndata = 2; 


if (mynode () ==1) { 
proc[0] * 0; 
local [0] = 1; 
procfl] * 1; 
local [1] = 0; 
ndata = 2; 

> 
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schedinfo = schedule(local,proc,ndata) ; 


3.4 PREFIXgather() 

PREFIX can be d (double precision), i (integer) , f (floating point) or c (character) 
This procedure is the gather exchanger procedure described above and in [1]. PRE- 
FlXgather uses a schedule produced by a call to schedule, the schedule is passed to 
PREFIXgather in structure SCHED schedinfo. Copies of data values obtained from 
other processors are placed in memory pointed to by buffer. Also passed to PREFIX 
gather is a pointer to the location from which data is to be fetched on the calling 
processor. This pointer is designated here as aloe, aloe corresponds to aloe ' above 
and in [1]. 


Synopsis 

void PREFIXgather(schedinfo, buffer, aloe) 

Parameter Declarations 

SCHED *schedinfo information obtained from schedule’s preprocessing of ref- 
erence pattern 

TYPE *buffer pointer to buffer for copies of gathered data values 
TYPE *aloc location from which data is to be fetched from calling processor 

Return Value 

None 

Example 
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We assume that schedule has already been called with the parameters presented 
in Section 3.3. Our example will assume that we wish to gather double precision 
numbers, i.e. that we will be calling dgather. On each processor, *aloc points to 

the arrays from which values are to be obtained. *buffer points to the location 
into which will be placed copies of data values obtained from other processors. 


double buffer [2], aloe [3]; 
SCHED *schedinfo; 


for(i=0; i<3;i++){ 

aloc[i] = mynodeO + 0.1*i; 

> 

dgather(schedinfo,buffer,aloc) ; 


On processor 0, buffer[0] and buffer[l] are now equal to 1.1 and 1.2. On processor 
1, bufferfO] and buffer[l] are now equal to 0.1 and 1.0. 


3.5 PREFIXscatter() 

PREFIX can be d (double precision), i (integer) , f (floating point) or c (character). 
This procedure is the scatter exchanger procedure described above and in [1]. PRE- 
FlXscatter uses a schedule produced by a call to schedule, the schedule is passed to 
PREFIXscatter in structure SCHED schedinfo. Copies of data values to be scattered 
to other processors are placed in memory pointed to by buffer. Also passed to PRE- 
FIX scatter is a pointer to the location to which copies of data are to be written 
on the calling processor. This pointer is designated here as aloe, aloe corresponds to 
aloe 1 above and in [1]. 
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Synopsis 


void PREFIXscatter(schedinfo, buffer, aloe) 

Parameter Declarations 

SCHED schedinfo information obtained from schedule’s preprocessing of refer- 
ence pattern 

TYPE *buffer points to data values to be scattered from a given processor 

TYPE *aloc points to first memory location on calling processor for scattered 
data 

Return Value 
None 
Example 

We assume that schedule has already been called with the parameters presented 
in Section 3.3. Our example will assume that we wish to scatter double precision 
numbers, i.e. that we will be calling dscatter. On each processor, *aloc points to 
the arrays to which values are to scattered. *buffer points to the location from 
which will be obtained data that will be scattered The processor and local-array 
index to which the values are to be scattered was designated during an earlier call 
to schedule. 


double buffer [2], aloe [3]; 
SCHED *schedinfo; 


for(i=0; i<3; i++){ 
aloc[i] = 10.0; 

> 

if (mynode ( ) ==0 ) { 
buffer [0] = 444.44; 
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bufferCl] = 555.55; 


if (mynode () == 1 ) { 
buffer [0] = 666.66; 
bufferCl] * 777.77; 


dscatter(schedinf o, buff er, aloe) ; 


On processor 0, the first three elements of aloe are 10.0, 666.66 and 10.0. On 
processor 1, the first three elements of aloe are 777.77, 444.44 and 555.55. 


3.6 PREFIXscatter_FUN C ( ) 


PREFIX can be d (double precision), i (integer) , f (floating point) or c (character). 
FUNC can be add, sub or mult . PREFIXscatter stores data values to specified 
locations. PREFIXscatter -FUNC allows one processor to specify computations that 
are to be performed on the contents of given memory location of another processor. 
The procedure is in other respects analogous to PREFIXscatter. 


Synopsis 

void PREFIXscatter JUNC(schedinfo, buffer, aloe) 

Parameter Declarations 

SCHED *schedinfo information obtained from schedule’s preprocessing of ref- 
erence pattern. 

TYPE *bufFer points to data values that will form operands for the specified 
type of remote operation. 

TYPE *aloc points to first memory location on calling processor to be used as 
targets of remote operations. 
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Return Value 
None 
Example 

We assume that schedule has already been called with the parameters presented 
in Section 3.3. Our example will assume that we wish to scatter and add double 
precision numbers, i.e. that we will be calling dscatter_add. On each processor, 

*aloc points to the arrays to which values are to be scattered and added. *buffer 
points to the location from which will be obtained the values to be scattered and 
added. The processor and local-array index to which the values are to be scattered 
and added was designated during an earlier call to schedule. 


double buffer [2] , aloe [3] ; 
SCHED *schedinfo; 


f or (i=0 ; i <3 ; i++) { 
aloc[i] = 10.0; 

> 

if (mynode () ==0) { 

buffer [0] = 444.44; 
buffer [1] = 555.55; 

} 

if (mynode () ==1) { 

buffer [0] = 666.66; 
buffer [1] = 777.77; 

> 

dscatter_add(schedinfo .buffer , aloe) ; 
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On processor 0, the first three elements of aloe are 10.0, 676.66 and 10.0. On 
processor 1, the first three elements of aloe are 787.77, 454.44 and 565.55. 


3.7 build_translation_table() 

In order to allow a user to assign globally numbered indices to processors in an irregu- 
lar pattern, it is useful to be able to define and access a distributed translation table. 
By using a distributed translation table, it is possible to avoid replicating records of 
where distributed array elements are stored in all processors. The distributed table 
is itself partitioned in a very regular manner. A processor that seeks to access an 
element I of a irregularly distributed data array is able to compute a simple function 
that designates a location in the distributed table; the location of the actual array 
element sought is obtained from the distributed table. 

The procedure build_translation_table constructs a distributed translation table. 
It assumes that distributed array elements are globally numbered. Each processor 
passes build-translation_table a set of indices for which it will be responsible. The 
distributed translation table may be striped or blocked across the processors. With 
a striped translation table, the translation table entry for global index I is stored in 
processor (I modulo number-of.processors) ; the local index of the translation table 
is (1/ number-of.processors). In a blocked translation table, translation table entries 
are partitioned into a number of equal sized ranges of contiguous integers, these 
ranges are placed in consecutively numbered processors. With blocked partitioning, 
the block corresponding to index I is (I/B) and the local index is (I modulo B), 
where B is the size of the block. Let M be the maximum global index passed to 
build_translation_table by any processor and NP represent the number of processors; 
B = \M/NP]. 

build-translation-table returns a pointer to a structure of type TTABLE; this 
pointer is used in dereference, defined in section 3.8. 


Synopsis 

TTABLE *build_translation_table(part,indexarray,ndata) 
Parameter Declarations 
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int part how translation table will be mapped - may be BLOCKED or STRIPED 

int *indexarray each processor P specifies list of globally numbered indices for 
which P will be responsible 

int ndata number of indices for which processor P will be responsible 
Return Value 

structure of type TTABLE; this structure contains a given processor’s portion of 
the distributed translation table 

Example 

An example to demonstrate the use of both build_translation_table and dereference 
can be found in Section 3.8. 


3.8 dereference() 

dereference accesses the distributed translation table constructed in build-translation.table. 

dereference is passed a pointer to a structure of type TTABLE; this structure de- 
fines the irregularly distributed mapping created in procedure build_translation_table. 
dereference is passed an array with global indices that need to be located in distributed 
memory; dereference returns arrays local and proc that contain the processors and 
local indices corresponding to the global indices. 


Synopsis 

void dereference(indexJtable, global, local, proc,ndata) 

Parameter declarations 

int *global list of global indices we wish to locate in distributed memory 

int *local local indices obtained from the distributed translation table that cor- 
respond to the global indices passed to dereference 

int *proc array of distributed translation table processor assignments for each 
global index passed to dereference 
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Table 1 

Values obtained 

ay dereference 

Processor 

proc[0] 

local [0] 

proc[l] 

local [1] 

0 

0 

0 

i 

0 

1 

1 

1 

0 

1 


int ndata number of elements to be dereferenced 

TTABLE *index_table distributed translation table datastructure created in 
build-translation-table 

Return value 

None 

Example 

A one dimensional distributed array is partitioned in some irregular manner so we 
need a distributed translation table to keep track of where one can find the value 
of a given element of the distributed array. 

In the example below, we initialize a translation table. Processor 0 calls build_translation_tabl 
and assigns indices 0 and 3 to processor 0, processor 1 calls build-translation.table 
and assigns indices 1 and 2 to processor 1. The translation table is partitioned 
between processors in blocks. 

Processor 0 then uses the translation table to dereference global variables 0 and 1, 
processor 1 uses the translation table to dereference global variables 2 and 3. On 
each processor, dereference carries out a translation table lookup. The values of 
proc and local are returned by dereference are shown in Table 1). The user gets 
to specify the processor to which each global index is assigned, note however that 
build_translation_table assigns local indices. 


#include <stdio.h> 

#include "parti.h" 

mainO 

{ 

int size, i, *index_array ; 
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int *deref .array; 
int *local, *proc; 
TTABLE *table; 


size = 2; 

index_array = (int *) malloc(sizeof (int)*size) ; 
deref_array = (int *) malloc(sizeof (int)*size) ; 
local = (int *) malloc(sizeof (int)*size) ; 
proc = (int *) malloc(sizeof (int)*size) ; 


/♦Assign indices 0 
if (mynode ()==0) 

{ 

index_array [0] 
index_array[l] 
> 

/♦Assign indices 1 
if (mynode () ==1 ) 

index_array [0] 
index_array [1] 
> 


and 3 to processor 0 


= 0 ; 

- 3; 

and 2 to processor i 


i; 

2 ; 


♦/ 


♦/ 


/♦ set up a translation table ♦/ 

table * build_translation_table (BLOCKED, index_array, size) 


/* Processor 0 seeks processor and local indices 
for global array indices 0 and 1 */ 
if (mynode ()==0) 

deref _array [0] = 0; 
deref .array [1] = 1; 

> 

/* Processor 1 seeks processor and local indices 
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for global array indices 2 and 3 */ 
if (mynode()==l) 

{ 

deref_array [0] = 2; 
deref_array [1] = 3; 

> 

/* Dereference a set of global variables */ 

deref erence (table ,deref_array , local, proc, size) ; 

/* local and proc return the processors and local indices where 
global array indices are stored. 

In processor 0, proc[0] = 0, proc[l] * 1, local [0] = 0 , local [1] - 0 
In processor 1, proc[0] = 1, proc[l] = 0, local [0] = 1 , local [l] = 1 
*/ 

> 


Now assume that processor 0 needs to know to values of distributed array elements 
0,1, and 3 while processor 1 needs to know the value of element 2. We call deref- 
erence to find the processors and the local indices that correspond to each global 
index. At this point schedule can be called and gathers and scatters carried out. 


3.9 localize() 

When loops access data residing off processor, some pre-processing is necessary before 
these loops can be executed. The pre-processing involves setting a schedule to bring 
in the off-processor data, and changing all the global references t o local ones. The 
primitive localize makes calls to dereference and schedule to do all the necessary 
processing. The schedule pointer returned by localize is used to gather data and 
store it at the end of the local array. This schedule pointer is created such that 
multiple copies of the same data is not brought in during the gather phase. The 
elemination of duplicates is achieved by using a hash table. Localize returns the 
local reference string corresponding to the global references which are passed as a 
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parameter to it. The number of off processor data elements are also returned by 
localize so that one can allocate enough space at the end of the local array. 


Synopsis 

void localize(tabptr,lsched,global_refs, local_refs,ndata,n_off-proc,my-size) 

Parameter Declarations 

TTABLE *tabptr pointer to the distributed translation table, build for the local 
array being dealt with. 

SCHED **lsched pointer to the data structure for schedule, which stores all the 
send receive information (returned by localize), 
int *global_refs pointer to the array which stores all the global reference string. 

int *local_refs pointer to the array which stores the local reference string corre- 
sponding to the global references (returned by localize). 

int ndata number of global references. 

int *n_off_proc address of the number of off processor data (returned by localize), 
int my _size the size of my local array. 

Return Value 
None 
Example 

Nodes 0 and 1 takes part in a computation which involves a loop which refers to 
data residing off processor. The irregularly distributed arrays are x and y. Both 
the arrays have the same distribution pattern. Node 0 contains global indices 0, 1 
and 2, while node 1 contains 3, 4, 5, 6 and 7. During the actual computation both 
nodes 0 and 1 needs to access certain elements of the y array. The global indices 
that node 0 has to access is 3, 7 and 1, and node 1 has to access 4, 2, 3, 0 and 6. 
Now we will present the inspector-executor code for the senario described above. 
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#def ine BLOCKED 1 


int i .ndata, indirection; 

int local [5] , global_ref [5] , local_ref [5] ; 

double x[5],y[10]; 

TTABLE *tabptr; 

SCHED *schedptr; 


/* the following is the inspector code */ 

if(mynode() *■ 0){ 
local [0] = 0; 
local [1] = 1; 
local [2] = 2; 
ndata ■ 3; 

tabptr = build_translation_table(BLOCKED, local .ndata) ; 
global_ref [0] = 3; 
global_ref [1] = 7; 
global_ref [2] = 1 ; 

localize (tabptr ,&schedptr , global_ref , 

local_ref , ndata , &n_of f _proc , 3) ; 

} else { 

local [0] = 3; 
local [l] = 4; 
local [2] = 5; 
local [3] = 6; 
local [4] = 7 ; 
ndata = 5; 

tabptr = build_translation_table(BLOCKED, local, ndata) ; 

global_ref [0] = 4; 

global_ref [1] = 2; 

global.ref [2] = 3; 

global_ref [3] = 0; 

global_ref [4] = 6; 

localize (tabptr ,&schedptr , global_ref , 
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> 


local_ref ,ndata,&n_off_proc,5) ; 


/* end of the inspector. Let us assign values to 
the distributed arrays */ 

f or (i=0 ; i<ndata; i++) { 
x[i] = i; 

yCi] = 2*i; 

> 

/* the following is the executor code */ 

dgather(schedptr ,&y [ndata] ,y) ; 

for(i=0; i<ndata; i++){ 

indirection = local_ref [i] ; 
x[i] * x[i] + 3 * y [indirection] ; 

> 

/* end of the executor code */ 


After the end of the computation in processor 0 the values of x[0], x[l] and x[2] 
are 0.0, 25.0 and 8.0 respectively. On processor 1 the values of x[0], x[2], x[3], x[4] 
and x[5] are 6.0, 13.0, 2.0, 3.0 and 22.0 respectively. For a detailed example in 
FORTRAN refer to appendix B. 


4 Calling the primitives from FORTRAN 

This section shows how the primitives can be used with FORTRAN. We will go 
through the examples described in section 3 using the FORTRAN version of the 
PARTI primitives. 
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4.1 function ifschedule() 

This function returns an integer which can be used to refer to the schedule corre- 
sponding to the input data.. This integer is used in gather, scatter and scatter_FUNC 
operations (Sections 4.2, 4.3, 4.4). 


Synopsis 

function ifschedule(ilocal,iproc,ndata) 

Parameter declarations 

integer ilocal() local indices to be gathered from or scattered to 
integer iproc() processors to be gathered from or scattered to 
integer ndata number of data elements involved in gather or scatter 

Return value 

Returns a reference to a schedule which can be used in PREFIXfgather, PREFIXf- 
scatter, PREFIXfscatter_add, PREFIXfscatter_sub, PREFIXfscatter_mult. 

Example 

Node 0 schedules a fetch of elements 1 and 2 from a (so far unspecified) array on 
node 1; node 1 schedules a fetch of element 1 from an array on node 0 and 3 from 
an array on node 1. 


logical ifschedule 

integer ilocal(2), iproc(2), ndata 

integer ischedinfo 


if (mynodeO . eq.0){ 
iproc(l) = 1 
ilocal(l) = 1 
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iproc(2) * 1 
ilocal(2) = 2 
ndata = 2 

> 

if (mynodeO . eq.l){ 
iproc(l) = 0 
ilocal(l) = 1 
iproc(2) * 1 
ilocal (2) = 3 
ndata = 2 

> 

ischedinfo = ifschedule(ilocal,iproc, ndata) 


4.2 subroutine PREFIXfgather() 

PREFIX can be d (double precision), i (integer) , f (real) or c (character). For more 
information refer to Section 3.4. 

Synopsis 

subroutine PREFIXfgather(ischedinfo, buffer, aloe) 

Parameter Declarations 

integer ischedinfo refers to the relevant schedule 

TYPE buffer() pointer to buffer for copies of gathered data values 

TYPE aloc() location from which data is to be fetched from calling processor 

Return Value 
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None 


Example 

We assume that schedule has already been called with the parameters presented 
in Section 4.1. Our example will assume that we wish to gather double precision 
numbers, i.e. that we will be calling dfgather. On each processor, aloe points to 
the arrays from which values are to be obtained, buffer points to the location into 
which will be placed, copies of data values obtained from other processors. 


double precision buffer(2), aloc(3) 
integer ischedinfo 


do 10 i-1,3 

aloc(i) = mynodeQ + 0.!*i 
10 continue 

call dfgather (ischedinfo , buffer, aloe) 


On processor 0, buffer(l) and buffer(2) are now equal to LI and 1.2. On processor 
1, buffer(l) and buffer(2) are now equal to 0.1 and 1,3. 

4.3 subroutine PREFIXfscatter() 

PREFIX can be d (double precision), i (integer) , f (real) or c (character). For more 
information refer to Section 3.5. 


Synopsis 

subroutine PREFIXfscatter(ischedinfo, buffer, aloe) 
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Parameter Declarations 


integer ischedinfo refers to the relevant schedule. 

TYPE buffer() points to data values to be scattered from a given processor 

TYPE aloc() points to first memory location on calling processor for scattered 
data 

Return Value 
None 
Example 

We assume that schedule has already been called with the parameters presented 
in Section 4.1. Our example will assume that we wish to scatter double precision 
numbers, i.e. that we will be calling dfscatter. On each processor, aloe points 
to the arrays to which values are to scattered, buffer points to the location from 
which will be obtained data that will be scattered The processor and local-array 
index to which the values are to be scattered was designated during an earlier call 
to schedule. 


double precision buffer(2), aloc(3) 
integer ischedinfo 


do 10 i=l ,3 

aloc(i) = 10.0 
10 continue 

if (mynode () . eq . 0) then 
buffer (l) = 444.44 
buffer (2) = 555.55 
endif 


if (mynode () . eq . 1) then 
buffer(l) = 666.66 
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buffer (2) = 777.77 
endif 

call df scatter (ischedinf o, buff er, aloe) 


On processor 0, the first three elements of aloe are 666.66, 10.0 and 10.0. On 
processor 1, the first three elements of aloe are 444.44, 555.55 and 777.77. 


4.4 subroutine PREFIXfscatter_FUNC() 

PREFIX can be d (double precision), i (integer) , f (real) or c (character). For more 
information refer Section 3.6. 


Synopsis 

subroutine PREFIXfscatter_FUNC(ischedinfo, buffer, aloe) 

Parameter Declarations 

integer ischedinfo refers to the relevant schedule. 

TYPE buffer() points to data values that will form operands for the specified 
type of remote operation. 

TYPE a!oc() points to first memory location on calling processor to be used as 
targets of remote operations. 

Return Value 

None 

Example 
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We assume that schedule has already been called with the parameters presented 
in Section 4.1. Our example will assume that we wish to scatter and add double 
precision numbers, i.e. that we will be calling dfscatter_add. On each processor, 
aloe points to the arrays to which values are to be scattered and added, buffer 
points to the location from which will be obtained the values to be scattered and 
added. The processor and local-array index to which the values are to be scattered 
and added was designated during an earlier call to schedule. 


double precision buffer(2), aloc(3) 
integer ischedinfo 


do 10 i=l ,3 

aloc(i) = 10.0 
10 continue 

if (mynodeO .eq.O) then 
buffer (1) = 444.44 
buffer (2) * 555.55 
endif 

if (mynode () . eq . 1) then 
buffer(l) = 666.66 
buffer (2) = 777.77 
endif 

call df scatt er_add( ischedinfo, buff er, aloe) 


On processor 0, the first three elements of aloe are 676.66, 10.0 and 10.0. On 
processor 1, the first three elements of aloe are 454.44, 565.55 and 787.77. 
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4.5 function ifbuild_translation_table() 

For detailed information refer to Section 3.7. 

Synopsis 

function ifbuild_translation_table(part,indexarray ? ndata) 

Parameter Declarations 

integer part how translation table will be mapped - may be BLOCKED or 
STRIPED 

integer indexarrayQ each processor P specifies list of globally numbered indices 
for which P will be responsible 

integer ndata number of indices for which processor P will be responsible 
Return Value 

integer which refers to the translation table corresponding to the input data. 
Example 

An example to demonstrate the use of both build-translation -table and dereference 
can be found in Section 4.7. 


4.6 subroutine flocalize() 

For more information refer to Section 3.9 

Synopsis 

subroutine flocalize(itabptr,ilsched,iglobal_refs, ilocal_refs, ndata, n_off_proc, my .size) 
Parameter Declarations 

integer itabptr refers to the relevant translation table pointer. 
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integer ilsched refers to the relevant schedule pointer (returned by localize). 

integer iglobal_refs() the array which stores all the global reference string. 

integer ilocal_refs() the array which stores the local reference string correspond- 
ing to the global references (returned by localize). 

integer ndata number of global references. 

integer n_ofF_proc number of off-processor data (returned by localize), 
integer my_size the size of my local array. 

Return Value 
None 
Example 

Nodes 0 and 1 takes part in a computation which involves a loop which refers to 
data residing off processor. The inspector and the executor code is presented here. 


integer i .ndata, indirection 

integer local (5) , iglobal_ref (5) , ilocal.ref (5) 
double precision x(5),y(l0) 
integer itabptr 
integer ischedptr 

logical ifbuild_translation_table 


c the following is the inspector code 


BLOCKED = 1 


if (mynodeO . eq. 0) then 
ilocal(l) = 1 
ilocal(2) * 2 
ilocal(3) * 3 
ndata = 3 
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mysize = 3 

itabptr = if build_translation_table (BLOCKED, ilocal, ndata) 
iglobal.ref (1) = 4 
iglobal_ref (2) = 8 
iglobal_ref (3) ■ 2 

call f localize (itabptr , ischedptr, iglobal_ref , 
ilocal_ref ,ndat a, n_off_proc, mysize) 

else 

ilocal(l) * 4 
ilocal(2) = 5 
ilocal(3) = 6 
ilocal(4) - 7 
ilocal(5) = 8 
ndata = 5 
mysize = 5 

itabptr = if build_translat ion_table (BLOCKED, ilocal, ndata) 

iglobal_ref (1) * 5 

iglobal_ref (2) * 3 

iglobal_ref (3) = 4 

iglobal_ref (4) = 1 

iglobal_ref (5) * 7 

call f localize (itabptr , ischedptr , iglobal_ref , 
ilocal_ref , ndata, n_off_proc, mysize) 

endif 

c 

do 10 i=l, ndata 

iglobal_ref (i) = ilocal_ref (i) 

10 continue 

c end of the inspector. Let us assign values to 
c the distributed arrays 

do 20 i=l, ndata 
x(i) = i 

y(i) = 2*i 

20 continue 
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c the following is the executor code 

call dfgather(ischedptr ,y(ndata) ,y(l)) 

do 30 i=l,ndata 

indirection = iglobal_ref (i) 
x(i) * x(i) + 3 * y(indirection) 

30 continue 

c end of the executor code 


After the end of the computation in processor 0 the values of x(l), x(2) and x(3) are 
25.0, 50.0 and 15.0 respectively. On processor 1 the values of x(l), x(2), x(3), x(4) 
and x(5) are 31.0, 20.0, 27.0, 10.0 and 47.0 respectively. For a detailed example in 
FORTRAN refer to appendix B. 


4.7 subroutine fdereference() 

For more information about this section refer to Section 3.8. 

Synopsis 

subroutine fdereference(index_table, global, local, proc,ndata) 

Parameter declarations 

integer index-table refers to the relevant translation table 

integer globalQ list of global indices we wish to locate in distributed memory 

integer local() local indices obtained from the distributed translation table that 
correspond to the global indices passed to dereference 

integer proc() array of distributed translation table processor assignments for 
each global index passed to dereference 

integer ndata number of elements to be dereferenced 
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Table 2: Values obtained 1 

:>y dereference 

Processor 

proc(l) 

local(l) 

proc(2) 

local(2) 

0 

0 

1 

i 

1 

1 

i 

2 

0 

2 


Return value 
None 
Example 

A one dimensional distributed array is partitioned in some irregular manner so we 
need a distributed translation table to keep track of where one can find the value 
of a given element of the distributed array. 

In the example below, we initialize a translation table. Processor 0 calls build_translation.tabl 
and assigns indices 1 and 4 to processor 0, processor 1 calls build_translation_table 
and assigns indices 2 and 3 to processor 1. The translation table is partitioned 
between processors in blocks. 

Processor 0 then uses the translation table to dereference global variables 1 and 2, 
processor 1 uses the translation table to dereference global variables 3 and 4. On 
each processor, dereference carries out a translation table lookup. The values of 
proc and local are returned by dereference are shown in Table 2). The user gets 
to specify the processor to which each global index is assigned, note however that 
build_translation_table assigns local indices. 


c 


program dref 


integer size, i, index_array (2) 
integer ideref _array (2) 
integer ilocal(2), iproc(2) 
logical ifbuild_translation_table 


c Assign indices 1 and 4 to processor 0 
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if (mynodeQ . eq. 0) then 
index_array (1) = 1 
index_array (2) = 4 
endif 

c Assign indices 2 and 3 to processor 1 

if (mynodeO . eq. 1) then 
index_array (1) = 2 
index_array (2) = 3 
endif 

c set up a translation table 

BLOCKED = 1 
size = 2 

itable = ifbuild_translation_table(BLOCKED,index_array,size) 

c Processor 0 seeks processor and local indices 
c for global array indices 0 and 1 */ 

if (mynodeO . eq. 0) then 
ideref _array(l) = 1 
ideref _array(2) = 2 
endif 

c Processor 1 seeks processor and local indices 
c for global array indices 2 and 3 */ 

if (mynodeO . eq. 1) then 
ideref _array(l) = 3 
ideref_array(2) = 4 
endif 

c Dereference a set of global variables 

call f dereference ( itable, deref .array, local, proc, size) 
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c local and proc return the processors and local indices where 
c global array indices are stored. 

c In processor 0, proc(l) = 0, proc(2) = 1, local(l) = 0 , local(2) * 0 
c In processor 1, proc(l) = 1, proc(2) = 0, local(l) * 1 , local(2) = 1 
stop 
end 


Now assume that processor 0 needs to know to values of distributed array elements 
1,2, and 4 while processor 1 needs to know the value of element 3. We call deref- 
erence to find the processors and the local indices that correspond to each global 
index. At this point schedule can be called and gathers and scatters carried out. 
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A Sweep over the Edges of an Unstructured Mesh 

This code can be found in the directory examples/unst. This goes through the whole 
process of setting up the inspector and then the subroutine executor is called to do the 
actual computation. There is a driver program which is included in the distribution 
but not added in this section. The executor is a loop which has been taken out of 
a real CFD code, where the loop is over the edges of the mesh. In the subroutine 
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executor, if we remove the calls to gather and scatter_add then the piece of code looks 
identical to the sequential version. 


c 

c The subroutines inspector and executor for sweep over an 
c arbitrary unstructured mesh is shown below, 
c 

c There is a driver code which calls these two subroutines after 
c reading in the mesh structure and initialization data. This 
c shows how the different PARTI primitives can be called 
c from FORTRAN, 
c 

c 


c ■- 

subroutine inspector (ledge ,myvals,nde) 

c — — 

c 

c — 

♦include "commonl.F" 

c 

common/node/ ntotnodes, nonode ,noedge 
common/sched/ lesched 
common/offproc/ ne_off_proc 
c 

integer nde(ledge,2) 
integer myvals (nonode) 
c 

c Local Variables 

c 

integer ig_ref _e(nge) 
integer locale(nge) 
logical ifbuild_translation_table 
c 

c Build the translation table 


37 


c 

itabptr = ifbuilcLtranslat ion_table(l ,myvals ,nonode) 
c 

c Setup global references for edge loop 

c 

do 20 i = 1, noedge 

ig_ref_e(i) = nde(i,l) 
ig_ref _e(noedge+i) = nde(i,2) 


20 continue 

iecount = 2 * noedge 
c 

c Setup schedule and change global ref. to local ref. 

c 


call f localize (it abptr , lesched , ig_ref _e , locale , 

iecount ,ne_of f _proc , nonode) 
c 

do 40 i * 1, noedge 
nde(i,l) = locale(i) 
nde(i,2) » locale(noedge+i) 

40 continue 
c 

return 

end 

c 

c 

c 

C — — — — 

c 

subroutine executor (ledge, Inode, nde,gnorm,w,p,dtl , if lop) 
c 

c 

c 

real*8 rm , al , yaw , gamma , rhoO , pO , eiO , hO , cO , uO , vO , wO 
real*8 cf 1 ,bc , visO, visl , vis2 ,hm, smoop 
c 

common/node/ ntotnodes , nonode, noedge 
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common/sched/ lesched 
common/offproc/ ne_off_proc 

common/tsp/ cf 1 ,bc,visO, visl ,vis2 ,hra,smoop,ncycsm 
common/flw/ rm, al, yaw, gamma, rhoO ,p0 ,eiO,hO,cO,uO,vO ,w0 
c 

integer nde(ledge,2) 
real*8 gnorm(ledge ,5) 
real*8 dtl (Inode) 
real*8 w(lnode,5) ,p(lnode) 
c 

c — Local variables 
c 

real*8 ccl , cc2 , csl ,cs2 , al , a2 , qs ,f luxl ,f lux2 


c — Initialize Time Step 
c 

do 50 i=l, nonode 
dtl(i) * 0.0D0 
50 continue 
c 

c — Do all the Gathers 
c 

do 60 kk = 1,4 

call dfgather(lesched,w(nonode+l ,kk) ,w(l,kk)) 
60 continue 

call df gather (lesched ,p(nonode+l) ,p(l)) 
do 63 i = 1 ,ne_off _proc 
dtl (nonode+i) = 0.0D0 
63 continue 
c 

c — Compute Field Time-Steps Using Edge Format 
c 

do 500 i=l, noedge 
nl = nde(i,l) 

n2 * nde(i ,2) 

ccl = dsqrt(gamma*p(nl)/w(nl,l)) 
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cc2 = dsqrt(gamma*p(n2)/w(n2,l)) 

csl = ccl*gnorm(i ,4) 

cs2 * cc2*gnorm(i,5) 

al * (gnorm(i,l)*w(nl,2) + gnorm(i,2)*w(nl,3) 

+ gnorm(i ,3)*w(nl ,4) ) / w(nl,l) 
a 2 * (gnorm(i,l)*w(n2,2) + gnorm(i ,2)*w(n2,3) 

+ gnorm(i,3)*w(n2,4) ) / v(n2,l) 

qs = (al + a2) / 2.0D0 

fluxl « dabs(qs) + csl 

flux2 * dabs(qs) + cs2 

dtl(nl) = dtl(nl) + flux2 

dtl(n2) * dtl(n2) + fluxl 

500 continue 

iflop ■ iflop + (noedge * 28) 
c 

c — Do all the Scatters 
c 

call df scatter_add(lesched,dtl(nonode+l) ,dtl(l)) 
c 

return 

end 


B Example : Sparse matrix multiplication 

The following example of symmetric matrix vector multiplication can be found in the 
file mat mu It . c in the examples/ sparse_mat_mult directory. There is a host program 
which is present in the same directory but has not been listed here. The sparse matrix 
is obtained from the host program using the function get_sparse_mat () . Then we go 
through the pre- processing to generate all the fetch lists and build a schedule to bring 
in off-processor data. Lastly, the matrix multiplication procedure spmvm() is called. 
After the multiplication the values are scattered using the primitive scatter_add 

/♦**** a*******************************************************/ 

/* PARTI program to do a sparse matrix-vector multiplication */ 
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/* */ 

/* This program reads in a sparse matrix with the help of */ 

/* the host program and does a matrix vector multiplication. */ 
/* The is a listing of the node program and it is run by the */ 
/* host program. This program: */ 

/* */ 

/* 1) gets unstructured mesh (w/ help from host program) */ 

/* 2) does lots of memory and address stuff on it */ 

/* 3) generates a vector x */ 

/* 4) multiplies x by the matrix, getting y */ 

/* */ 

/*** *************** ********************************** *********/ 


#include <cube.h> 

•include <stdio.h> 

•include <math.h> 

•include "parti.h" 

•include "main.h" 
main(argc ,argv) 
int argc; 
char *argv[]; 

{ 

int i , j , count ; 

TTABLE *table; 

SCHED *sr ; 

double *x, *y, *z ; 

/* 

* 

* Get sparse matrix from host program. 

* 

*/ 

get_sparse_mat () ; 

/* 
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* 

* Build translation table by scattering Row to the table. 

* IN: Row[i] OUT: table 

* 

*/ 

table = build_translation_table (BLOCKED, Row, Myrows) ; 

/* 

* — — — 

* Look up address of Cols and put them in Local and Proc. 

* IN: Cols [i] , table OUT: Local [i] , Proc [i] 

* 

*/ 

dereference (table, Cols , Local , Proc, Mynonzeros) ; 


/* 

* - — 

* Loop through all proc/offset pairs and decide which 

* must be fetched from other processors. 

* IN: Local [i] , Proc [i] OUT: Fetch_l [i] ,Fetch_p [i] 

* 

*/ 

gen_fetch_list() ; 


/* 

* — — — — 

* Allocate memory for vectors , and set x[i] = i for local i. 

* ——————————— 

*/ 

x = (double *) malloc(sizeof (double) *Myrows) ; 
y = (double *) malloc(sizeof (double) *Myrows) ; 

for(i=0; i<Myrows; i++) x[i] = 1.0; 
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/* 

* — — 

* build communications schedule 

* IN: Fetch_l[i] ,Fetch_p[i] OUT: sr 

* 

*/ 

sr = schedule(Fetch_l,Fetch_p,Nfetch) ; 

/* 

* 

* Perform sparse-matrix vector multiplication. 

* 

*/ 

spmvm(sr,x,y) ; 

> 

/* END OF NODE PROGRAM */ 

/* 

* 

* This function is used to read in the sprse mat. 

* It should be ignored if at all possible. 

* 

*/ 

get_sparse_mat () 

{ 

int size, indx_buf f er [BUFFER_SIZE ] ; 
double coef_buffer [BUFFER_SIZE] ; 
int type, rows_expected; 

rows_expected = -1; 

Myrows = 0; 

Mynonzeros = 0; 
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gsyncO ; 

while ( (Myrows<rows_expected) I (rows_expected<0) ){ 
cprobe(-l) ; 
type = infotypeO; 
size = inf ocount C)/sizeof (int) ; 
if ( type==ROW_INDX_MSG ){ 

crecv( ROW_INDX_MSG,indx_buffer,size*sizeof (int) ) ; 
crecv ( R0W_C0EF_MSG , coef .buffer , size*sizeof (double) ) ; 
unpack_row_data(indx_buffer,coef .buffer, size) ; 

> 

if ( type==SETUP_MSG ){ 

crecv (SETUP _MSG , indx.buffer ,size*sizeof (int)) ; 
rows.expected * indx.buffer [mynodeO] ; 

Nrows = indx.buffer [numnodesO] ; 

> 

} 

gsyncO ; 


> 

/* 

* 

* The buffers are unpacked in the following 

* procedure 

* 

*/ 

unpack_row_data(indx_buffer, coef .buffer , size) 
int *indx_buff er , size; 
double *coef _buf f er ; 

{ 

int count, i, j, row, ncols, count2, ixx, ist; 
double sum; 

static int col.count = 0; 

for( count=0; count<size; ){ 

Row [Myrows] = indx.buffer [count] ; 
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Diags [Myrows] = coef _buff er[count] ; 
sum=Diags [Myrows] ; 

ncols = Ncols [Myrows] = indx_buffer [count+1] ; 
count=count+2 ; 

Mynonzeros += ncols; 

if( Myrows >= MAX_R0WS ){ 

fprintf (stderr, "Error on node '/, d : too many rows ! ! ! \n" ,mynode() ) ; 
exit () ; 

> 

if( Mynonzeros >= MAX_N0NZER0S ){ 

fprintf (stderr, "Error on node '/,d : too many nonzeros !!! \n" , 
mynode () ) ; 
exit () ; 

> 

for( j=0; j<ncols; j++){ 

Cols [col_count] = indx_buffer [count] ; 

Vais [col_count] = coef_buffer [count] ; 
sum+=Vals [col_count] ; 
col_count++ ; 
count++ ; 

> 

Myrows++ ; 

> 

> 

/* 

* 

* This function takes the Locol [i] ,Proc[i] 

* address for each nonzero col in the matrix 

* and puts nonlocal ones into Fetch_l [i] ,Fetch_p[i] 

* 

*/ 

gen_f etch_list () 
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int count, i, myproc; 

myproc = mynodeQ; 

/* count offnode refs. */ 

Nfetch = 0; 

for( i=0; KMynonzeros; i++) Nfetch += (Proc [i] ! =myproc) ; 

/* for each ref. */ 

Fetch, p = (int *) malloc(sizeof (int)*Nf etch*2) ; 

Fetch, 1 = &Fetch,p [Nfetch] ; 
count = 0 ; 

for( i=0; KMynonzeros; i++ ){ 
if( Proc[i] != myproc ){ 

/* if Col[i] refers to an off-proc location.. */ 

Fetch_p [count] = Proc[i]; /* add it to the fetch list */ 
Fetch, 1 [count] = Local [i] ; 
count++ ; 

} 

> 

> 

/* 

* 

* sparse matrix vector multiply function ! 

* require that the schedule be built and passed in 

* 

*/ 

spmvm(sr ,x,y) 

SCHED *sr; /* communication schedule */ 
double *x, *y; /* input and result vectors */ 

{ 

int myproc, bcount, count, i, j; 
double tmp, ^buffer, *ybuffer; 

/* Allocate local buffer to gather data into. */ 
buffer - (double *) malloc(sizeof (double)*Nf etch) ; 
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/* Allocate local buffer to store output vector values into. */ 
ybuffer = (double *) malloc(sizeof (double) *Nf etch) ; 

/* Gather data using previously computed communication schedule. */ 
dgather (sr , buffer ,x) ; 

myproc = mynodeO ; 
bcount = 0; 
count = 0; 

for( i=0; KMyrows; i++ ) y[i]=0.0; 
for( i=0; KNfetch; i++ ) ybuffer[i]=0.0; 

for( i=0; KMyrows; i++ ){ 
y[i] += Diags[i]*x[i] ; 
for( j=0; j<Ncols[i]; j++ ){ 

/* for each nonzero col .... */ 
if( Proc [count] == myproc ){ 

/* if col [count] is local */ 
y[i] += x [Local [count] ]*Vals [count] ; 
y [Local [count]] += x [i]*Vals [count] ; 

} else { 

/* otherwise look in buffer */ 
y[i] += buffer [bcount] *Vals [count] ; 
ybuffer [bcount] += x [i]*Vals [count] ; 
bcount++ ; 

> 

count ++ ; 

> 

> 

dscatter_add(sr ,ybuff er ,y) ; 
gsyncO ; 

for( i=0; i<Myrows; i++ ){ 

fprintf (myf ile , 11 after scatter processor V, d, y[V.d] = '/,lf\n", 
myproc, i,y[i]) ; 
ff lush(myf ile) ; 

> 

free (buffer) ; 
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f ree(ybuff er ) ; 


} 
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